Domain Specific Word Extraction from Hierarchical Web Documents: A First Step Toward Building Lexicon Trees from Web Corpora

نویسنده

  • Jing-Shin Chang
چکیده

Domain specific words and ontological information among words are important resources for general natural language applications. This paper proposes a statistical model for finding domain specific words (DSW s) in particular domains, and thus building the association among them. When applying this model to the hierarchical structure of the web directories node-by-node, the document tree can potentially be converted into a large semantically annotated lexicon tree. Some preliminary results show that the current approach is better than a conventional TF-IDF approach for measuring domain specificity. An average precision of 65.4% and an average recall of 36.3% are observed if the top-10% candidates are extracted as domain-specific words. 1 Domain Specific Words and Lexicon Trees as Important NLP Resources Domain specific words (DSW s) are important anchoring words for natural language processing applications that involve word sense disambiguation (WSD). It is appreciated that multi-sense words appearing in the same document tend to be tagged with the same word sense if they belong to the same common domain in the semantic hierarchy (Yarowsky, 1995). The existence of some DSW s in a document will therefore be a strong evidence of a specific sense for words within the document. For instance, the existence of basketball in a document would strongly suggest the sport sense of the word ( Pistons ), rather than its mechanics sense. It is also a personal belief that DSW-based sense disambiguation, document classification and many similar applications would be easier than sense-based models since sense-tagged documents are rare while domain-aware training documents are abundant on the Web. DSW identification is therefore an important issue. On the other hand, the semantics hierarchy among words (especially among sets of domain specific words) as well as the membership of domain specific words are also important resources for general natural language processing applications, since the hierarchy will provide semantic links and ontological information (such as is-A and part-of relationships) for words, and, domain specific words belonging to the same domain may have the synonym or antonym relationships. A hierarchical lexicon tree (or a network, in general) (Fellbaum, 1998; Jurafsky and Martin, 2000), indicative of sets of highly associated domain specific words and their hierarchy, is therefore invaluable for NLP applications. Manually constructing such a lexicon hierarchy and acquiring the associated words for each node in the hierarchy, however, is most likely unaffordable both in terms of time and cost. In addition, new words (or new usages of words) are dynamically produced day by day. For instance, the Chinese word (pistons) is more frequently used as the sport or basketball sense (referring to the Detroit

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mining Domain Specific Words from Web Documents

Web pages provide not only plain text materials for training language models but also tag information for semantics annotation. The tags could be found either explicitly in the HTML documents or implicitly through the directory hierarchy of the documents, since the directory hierarchy can be regarded as a kind of classification tree for web documents, which assigns an implicit hidden tag to eac...

متن کامل

Two-Step Flow in Bilingual Lexicon Extraction from Unrelated Corpora

This paper presents a language independent methodology for automatically extracting bilingual lexicon entries from the web without the need of resources like parallel or comparable corpora, POS tagging, nor an initial bilingual lexicon. It is suitable for specialized domains where bilingual lexicon entries are scarce. The input for the process is a corpus in the source language to use as exampl...

متن کامل

Query Architecture Expansion in Web Using Fuzzy Multi Domain Ontology

Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...

متن کامل

Building a fine-grained subjectivity lexicon from a web corpus

In this paper we propose a method to build fine-grained subjectivity lexicons including nouns, verbs and adjectives. The method, which is applied for Dutch, is based on the comparison of word frequencies of three corpora: Wikipedia, News and News comments. Comparison of the corpora is carried out with two measures: log-likelihood ratio and a percentage difference calculation. The first step of ...

متن کامل

Mining and Exploiting Domain-Specific Corpora in the

The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stages involved in the acquisition, production, updating and maintenance of the large language resources required by, among others, MT systems. The development of a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web is one of the most innovative building bloc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005